8-21-17

We are going to talk about this data set.

df <- read.csv("~/Documents/school/info3130/data/fraudclaims.csv")

Practice using excel…..

Answer these questions: - What was the average claim amount paid in each country? (use subtools in excel) - What percent of claims were denied in each country? - On a percentage basis, how much more (or less) money in claims was paid in May than in June? - Which countries pay the most claims

# I'll answer them in R...
# What was the average claim amount paid in each country?
df_sub <- df[df$Claim_Paid == 1, ]
aggregate(as.numeric(Amt_Paid) ~ Country, df_sub, mean)
##          Country as.numeric(Amt_Paid)
## 1      Australia             124.0000
## 2         Canada             263.7143
## 3         France             208.2000
## 4        Germany             127.5714
## 5          Italy             248.1429
## 6    Switzerland             392.6000
## 7 United Kindgom             165.8000
## 8  United States             377.8000
# What percent of claims were denied in each country?
prop.table(table(df$Country[df$Claim_Paid == 0]))
## 
##      Australia         Canada         France        Germany          Italy 
##     0.17948718     0.12820513     0.23076923     0.10256410     0.15384615 
##    Switzerland United Kindgom  United States 
##     0.10256410     0.02564103     0.07692308
# On a percentage basis, how much more (or less) money in claims was paid in May than in June?
# Which country pays the most claims?
# Need to be more specific....

Homework Assignment

Outline CRISP DM - What are the processes for analyzing data? You can do a picture or ….whatever

  • Internet usage data. Survey about how people use the internet… Questions are on one of the sheets in the excel doc.

8-28-17

(class discussion about data understanding and asking questions about our data.)

When I don’t have a million variables, I like to look at the data first.

df <- read.csv("./data/HeatingOil.csv")
library(GGally)
ggpairs(df)

# using excel, first install the data pack -> Descriptive Stats
# or, using R, we can just run this:
summary(df)
##    Insulation      Temperature     Heating_Oil    Num_Occupants   
##  Min.   : 2.000   Min.   :38.00   Min.   :114.0   Min.   : 1.000  
##  1st Qu.: 4.000   1st Qu.:49.00   1st Qu.:148.2   1st Qu.: 2.000  
##  Median : 6.000   Median :60.00   Median :185.0   Median : 3.000  
##  Mean   : 6.214   Mean   :65.08   Mean   :197.4   Mean   : 3.113  
##  3rd Qu.: 9.000   3rd Qu.:81.00   3rd Qu.:253.0   3rd Qu.: 4.000  
##  Max.   :10.000   Max.   :90.00   Max.   :301.0   Max.   :10.000  
##     Avg_Age        Home_Size    
##  Min.   :15.10   Min.   :1.000  
##  1st Qu.:29.70   1st Qu.:3.000  
##  Median :42.90   Median :5.000  
##  Mean   :42.71   Mean   :4.649  
##  3rd Qu.:55.60   3rd Qu.:7.000  
##  Max.   :72.20   Max.   :8.000

Homework assignment. You can pose questions in your paper that you want to ask if you were asked to analyse the data.

9-11-17

Talking about correlation today. Data set will be the Heating Oil example.

oil <- read.csv("~/Documents/school/info3130/data/HeatingOil.csv")
cor(oil)
##                Insulation Temperature Heating_Oil Num_Occupants
## Insulation     1.00000000 -0.79369606  0.73609688   -0.01256684
## Temperature   -0.79369606  1.00000000 -0.77365974    0.01251864
## Heating_Oil    0.73609688 -0.77365974  1.00000000   -0.04163508
## Num_Occupants -0.01256684  0.01251864 -0.04163508    1.00000000
## Avg_Age        0.64298171 -0.67257949  0.84789052   -0.04803415
## Home_Size      0.20071164 -0.21393926  0.38119082   -0.02253438
##                   Avg_Age   Home_Size
## Insulation     0.64298171  0.20071164
## Temperature   -0.67257949 -0.21393926
## Heating_Oil    0.84789052  0.38119082
## Num_Occupants -0.04803415 -0.02253438
## Avg_Age        1.00000000  0.30655725
## Home_Size      0.30655725  1.00000000

Let’s see if I can remember the expected value of covariance… Here’s one way I remember it:

\[Cov(x, y) = E(xy) - E(x)E(y) = E[X - \mu_x]E[Y - \mu_y]\]

\[Cov(x, y) = E[XY - X\mu_y - Y\mu_x + \mu_x \mu_y] = E(XY) -\mu_y E(X) - \mu_x E(Y) + E(\mu_x \mu_y)\]

Now we should be able to see some of these terms dissapear remembering that \(E(X) = \mu_x\).

\[E(XY) - \mu_y \mu_x - \mu_x \mu_y + \mu_x \mu_y = E(XY) - \mu_x \mu_y\]

This is the covariance. To get the correlation we divid the covaviance by \(\sigma_x \sigma_y\) I believe.

Back to the example. Just some visualizations.

library(ggplot2)
library(plotly)
gg <- ggplot(data = oil, aes(x = Avg_Age, y = Heating_Oil, color = Insulation))

gg <- gg + geom_point()# + geom_jitter()

# get an interactive plot:
ggplotly(gg)

Now to calculate the correlations.

cor(oil)
##                Insulation Temperature Heating_Oil Num_Occupants
## Insulation     1.00000000 -0.79369606  0.73609688   -0.01256684
## Temperature   -0.79369606  1.00000000 -0.77365974    0.01251864
## Heating_Oil    0.73609688 -0.77365974  1.00000000   -0.04163508
## Num_Occupants -0.01256684  0.01251864 -0.04163508    1.00000000
## Avg_Age        0.64298171 -0.67257949  0.84789052   -0.04803415
## Home_Size      0.20071164 -0.21393926  0.38119082   -0.02253438
##                   Avg_Age   Home_Size
## Insulation     0.64298171  0.20071164
## Temperature   -0.67257949 -0.21393926
## Heating_Oil    0.84789052  0.38119082
## Num_Occupants -0.04803415 -0.02253438
## Avg_Age        1.00000000  0.30655725
## Home_Size      0.30655725  1.00000000

9-19-17 Chapter 5

# following along in chapter 5 
df <- read.csv("~/Documents/school/info3130/data/Chapter05DataSet.csv"
               , colClasses = "factor")
df <- df[6:12]
df[df == 0] <- NA
library(arules)
rules <- apriori(df, parameter = list(minlen=2, supp=0.2,
                                          conf=0.5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5     0.2      2
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 696 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[7 item(s), 3483 transaction(s)] done [0.00s].
## sorting and recoding items ... [4 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [4 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(rules)
##     lhs              rhs           support   confidence lift     count
## [1] {Hobbies=1}   => {Religious=1} 0.2388745 0.7961722  1.901967 832  
## [2] {Religious=1} => {Hobbies=1}   0.2388745 0.5706447  1.901967 832  
## [3] {Family=1}    => {Religious=1} 0.2245191 0.5758468  1.375634 782  
## [4] {Religious=1} => {Family=1}    0.2245191 0.5363512  1.375634 782

9-25-17 Chapter 6

df <- read.csv("~/Documents/school/info3130/data/Chapter06DataSet.csv")
km <- kmeans(df[-3], 3) # look at k plot
data.frame(clust_size = km$size, km$centers)
##   clust_size   Weight Cholesterol
## 1        191 110.4607    125.9791
## 2        185 141.9946    173.2486
## 3        171 182.2632    217.0409
car::spm(df, diagonal = "histogram", 
         reg.line = NULL, smoother = NULL)

Visualize a different way (not using principal components): First plot all variables

km <- kmeans(df[,c("Weight", "Cholesterol", "Gender")], centers = 4)
# first plot the data with color overlay:
df$Gender <- ifelse(df$Gender == 0, "female", "male")
gg <- ggplot(data = df, aes(x = Weight, y = Cholesterol, color = Gender))
gg + geom_point()

# like in the book:
df$Kgroup <- as.factor(km$cluster)
gg <- ggplot(data = df, aes(x = Weight, y = Cholesterol, color = Kgroup))
gg + geom_point()

OK fine, looks like it is grouping OK.

Where k-means goes wrong

Here’s some data.

library(ggplot2)
df <- data.frame(x = rnorm(361), y = rnorm(361))
theta <- seq(0, 360, 1) * (pi / 180)
x <- 6 * cos(theta) + rnorm(361, sd = 0.25)
y <- 6 * sin(theta) + rnorm(361, sd = 0.25)
df <- rbind(df, data.frame(x, y))
ggplot(data = df, aes(x = x, y = y)) + geom_point()

There’s two distinct clusters.

K-means gets it wrong.

km <- kmeans(df, centers = 2)
df$cluster <- as.factor(km$cluster)
ggplot(data = df, aes(x = x, y = y, color = cluster)) + geom_point() + 
  ggtitle("K-means Clustering") + theme(legend.title = element_blank())

Hierarchal clustering gets it right.

d <- dist(df[-3])
hc <- hclust(d = d, method = "single")
memb <- cutree(hc, k = 2)
df$hclust <- as.factor(memb)
ggplot(data = df, aes(x = x, y = y, color = hclust)) + geom_point() + 
  ggtitle("Hierarchal Clustering") + theme(legend.title = element_blank())

9-25-17

df <- read.csv("~/Documents/school/info3130/data/uncatStudents.csv")
df_ab <- as.data.frame(df[, "Absences"])
names(df_ab) <- "Absences"
mean(df$Absences)
## [1] 2.519618
# we HAVE to consider outliers when using K-means
hist(df$Absences, col = "green", xlab = "Absences", main = "")

km2 <- kmeans(df_ab, centers = 2)
km2$centers
##   Absences
## 1 1.022388
## 2 6.194139
# what about 4 clusters?
km4 <- kmeans(df_ab, centers = 4)
km4$centers
##    Absences
## 1 9.8571429
## 2 3.6450000
## 3 6.8742515
## 4 0.6423488

How many clusters should there be?

n <- 8
wss <- rep(0, n)
for (i in 1:n){
  wss[i] <- sum(kmeans(df[-(1:3)], centers = i)$withinss)
}
plot(wss, type = "b", xlab = "k", ylab = "Within groups SS", 
     xlim = c(1, n + 0.5))
text(1:n, wss, pos = 4, round(wss,3), cex = 0.65)

K should probably be 2. This is nice since we can visualize this using principal components…

k <- 4
pc <- prcomp(df[-(1:3)])
km <- kmeans(df[, -(1:3)], centers = k)
temp <- data.frame(x = pc$x[, 1], y = pc$x[, 2], z = factor(km$cluster))
ggplot(data = temp, aes(x = x, y = y, color = z)) + geom_point() + 
  theme(legend.title = element_blank()) + ggtitle(paste("K =", k))

Test coming next week.

He does not do T/F or MC

5 or 6 questions for this one.

4 will be hands on the keyboard, applied knowledge questions. Descriptive stats: Central tendency, dispersion, \(2 \sigma\) (outliers).

Correlation: pearson, spearman, kendall Association rules. You need to have binomial variables. K-means (today). Centroids by hand Crisp-DM. What are the 6 parts and what order?

For the test we will download a word doc.